Topic Detection in Read Documents

نویسندگان

  • Rui Amaral
  • Isabel Trancoso
چکیده

In this paper, we address the importance and the problems involved in topic annotation in the speech retrieval domain. Identified the problem, an algorithm developed to perform automatic topic annotation of broadcast news (BN) speech corpora is described. The approach adopted is based in Hidden Markov Models (HMM) and topic language models, to solve topic segmentation and labelling tasks simultaneously. To overcome the lack of topic labelled material to train the statistical models, a two-stage unsupervised clustering was developed. Both stages are based on the nearest-neighbour search method, using the Kullback-Leibler as a distance measure. On-going experiments to evaluate the system performance are also described.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A review of text mining approaches and their function in discovering and extracting a topic

Background and aim: Four text mining methods are examined and focused on understanding and identifying their properties and limitations in subject discovery. Methodology: The study is an analytical review of the literature of text mining and topic modeling.  Findings: LSA could be used to classify specific and unique topics in documents that address only a single topic. The other three text min...

متن کامل

Motivation to Read in a Second Language: A Review of Literature

Reading motivation is a well-researched topic in relation to first language literacy development due to its influence on both reading processes and outcomes. In second language reading, the role of motivation has not been as thoroughly explored. The aim of this review of literature is to highlight established studies as well as recent explorations in some recurring areas of first and second lan...

متن کامل

Detection of Topic and its Extrinsic Evaluation Through Multi-Document Summarization

This paper presents a method for detecting words related to a topic (we call them topic words) over time in the stream of documents. Topic words are widely distributed in the stream of documents, and sometimes they frequently appear in the documents, and sometimes not. We propose a method to reinforce topic words with low frequencies by collecting documents from the corpus, and applied Latent D...

متن کامل

Clustering-Based Searching and Navigation in an Online News Source

The growing amount of online news posted on the WWW demands new algorithms that support topic detection, search, and navigation of news documents. This work presents an algorithm for topic detection that considers the temporal evolution of news and the structure of web documents. Then, it uses the results of the topic detection algorithm for searching and navigating in an online news source. An...

متن کامل

یک مدل موضوعی احتمالاتی مبتنی بر روابط محلّی واژگان در پنجره‌های هم‌پوشان

A probabilistic topic model assumes that documents are generated through a process involving topics and then tries to reverse this process, given the documents and extract topics. A topic is usually assumed to be a distribution over words. LDA is one of the first and most popular topic models introduced so far. In the document generation process assumed by LDA, each document is a distribution o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000